James Griffin

Written Homework 2

#1

Show that using the L2norm that steepest descent and gradient descent are equivalent. In other words, show that for a twice continuously differentiable function f (x) that − 1 ‖∇f ‖2 ∇f = argmin{v·∇f : ‖v‖2= 1}.

#2 Compute Hessian for following functions
b) #2
#3

We have discussed first and second order optimization methods that include the gradient and Hessian of the objective (loss) function. Does it make sense to also consider the third derivative (for now just consider a scalar function)? Why or why not?

-The main reason is that newtons methood is a second order approximation method. A third order method can converge cubically, but finding roots of cubic polynomiials is really hard -- so because the quadratic formula is simpler and easier, generally we dont deal with the convergence in the third order just cuz its harder to find roots,-- also the intrisic meaning of a second degree derivative zero, where there is an inflection point in our original function, a definent 0, only question is remaining is if it is he local or global max, where the 0 on the the third derivative will give you the same information, but pertaining to the first derivative, so using that to obtain information about the original function is slightly more convoluted  
#4

Following the previous problem, is it reasonable to consider information obtained from the fourth derivative in deriving an even higher order optimization method? Why or why not? If it does make sense, derive the update formula in one dimension.

-There are possible information to be gained from the fourth derivative, however most of that information will be encapsulated in the second dereivative, and the roots of the fourth order polynomial will also introduce a lot of noise. Additionally, as with every derivative you take, you lose information about the original function, so while you can gain some information, it still does not make a ton of sense to base optimization methodss off of  the fourth derivative. 
#5

There are cases where the Newton optimization method does not work well. Derive a condition on the second derivative of the objective (loss) function f (x) when the Newton method will do worse than pure gradient descent. What does this condition mean geometrically?